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Introduction* 

Japanese  was  one  of  the  languages  selected  for 
evaluation  of  named  entity  identification  algorithms  in 
the  TEPSTER-sponsored  Multilingual  Entity  Task 
(MET)  program.  As  with  the  Spanish  and  Chinese 
groups  (Table  1),  Japanese  systems  automatically  marked 
the  names  of  organizations,  people,  and  places  within 
entity  name  expressions  (ENAMEX),  dates  and  times 
within  time  expressions  (TIMEX),  and  percents  and 
money  within  number  expressions  (NUMEX).  The 
participant  Japanese  systems  were  developed  in  a  four- 
month  period  of  time  and  output  results  comparable  to 
the  Message  Understanding  Conference-6  (MUC-6)  [1] 
English  language  systems  with  F-Measures  between  70  - 
90%  [2], 
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Table  1:  MET  Participants 


The  Corpus 

Since  MET  was  designed  to  tackle  the  MUC-6 
named  entity  task  in  foreign  languages,  the  government 
needed  to  acquire  a  corpus  of  articles  rich  in  references  to 
people,  places,  and  organizations.  A  search  of  Kyodo 
newswire  data  using  the  keyword 
(press  conference)  yielded  the  desired  100-article 
development  and  100-article  test  corpora.  In  the  test 
corpus,  71%  of  the  tags  were  of  the  ENAMEX  type;  that 
is,  the  tagged  items  were  references  to  organizations, 
people,  and  places  (Table  2).  By  contrast,  for  example, 
the  150-article  TIPSTER  Phase  I  test  corpus  contained 
only  75  instances  of  person  names,  or  just  1%  of  all 
corpus  tags  [3]. 
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Table  2:  Distribution  of  Tag  Types 


Human  Performance 

One  motivation  for  conducting  the  named  entity  task 
in  a  foreign  language  such  as  Japanese  was  to  promote 
techniques  for  tackling  language-specific  difficulties  in 
recognizing  the  names  of  people  and  organizations. 
Unlike  English,  Japanese  cannot  rely  upon  orthographic 
clues  like  capitalization  to  identify  proper  nouns.  For 
this  reason,  and  based  upon  the  authors'  own  manual 
tagging  experience,  we  felt  that  identification  of 
ENAMEX  types  would  be  the  most  challenging  to  the 
participant  systems.  (See  Hard  Tag  Type  below) 

The  government  MET  Japanese  team  was  accorded 
the  opportunity  to  test  this  hypothesis  during  the  course 
of  preparing  dry-run  keys  for  the  initial  systems  test  in 
early  April.  Each  of  us  manually  tagged  the  same  25 
articles,  then  looked  at  the  resulting  annotator  variation. 
The  discrepancies  between  the  two  sets  of  tagged  data 
were  discussed  and  resolved.  The  final  product  formed 
ground  truth  or  the  keys  against  which  the  automatic 
participant  systems  were  scored.  Each  of  our  original 
manually  tagged  versions  was  also  scored  against  the 
keys.  The  figures  in  Table  3  are  similar  due  to  the  small 
number  of  articles  and  lack  of  degradation  in  human 
performance  over  the  short  period  of  time  it  took  for  each 
of  us  to  tag  (average:  <  5  minutes  per  article). 
Nonetheless,  the  ORGANIZATION  and  LOCATION 
subtypes  were  the  most  prone  to  error. 


This  material  has  been  reviewed  by  the  CIA.  That  review 
neither  constitutes  CIA  authentication  of  information  nor  implies 
CIA  endorsement  of  the  author's  views. 
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Table  3:  Human  Performance 


Table  4  shows  the  group  average  F-Measures  for  the 
participant  systems  against  both  the  dry-run  and  test 
keys.  The  government's  intuitive  assumption  concerning 
the  relative  difficulty  of  identifying  ENAMEX  types  - 
people,  places,  and  organizations  —  was  borne  out. 
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Table  4:  Systems  Performance 

Easy  Tag  Types:  NUMEX  &  TIMEX 


While  the  entity  name  expressions  were  relatively 
difficult  to  handle,  the  number  (NUMEX)  and  time 
(TIMEX)  expressions  encompassing  the  tag  subtypes 
PERCENT  &  MONEY  and  TIME  &  DATE, 
respectively,  were  handled  proficiently  by  the  participant 
systems.  As  Table  4  shows,  the  group  average  F- 
Measure  for  these  tag  types  was  over  90%  on  the  test 
data. 

PERCENT 

As  in  English,  the  typical  Japanese  contextual 
pattern  for  generating  a  valid  PERCENT  tag  was  an 
Arabic  numeral  +  the  "%"  sign,  e.g.,  ”10%."  The  fact 
that  the  systems  collectively  scored  less  than  100% 
(96%)  indicates  that  this  pattern  was  not  universal. 
Indeed,  the  Japanese  development  and  test  articles 
represented  percentages  in  various  other  ways  such  as  an 
Arabic  numeral  +  the  kanji  (Chinese  character)  for 
percent "  7  M;"  an  Arabic  numeral  +  the  katakana 
(foreign  loan-word  script)  for  percent,  e.g., 

"70  X— fe  >  h and  a  kanji  numeral  + 
the  kanji  for  percent,  e.g.,  "  till."  In  addition,  the 
MET  Japanese  Guidelines  [4]  stipulated  that  fractions 
such  as  "  1  0  (1/10),  which  are  easily 

calculable  as  percentages,  should  also  be  identified. 
Although  the  above-mentioned  patterns  are  more  varied 
than  what  one  typically  encounters  in  English  texts,  they 
nevertheless  constitute  a  standard  finite  list  which  the 
participant  systems  processed  well. 

MONEY  and  TIME 

The  contextual  patterns  manifested  in  the 
development  and  test  corpora  for  representing  MONEY 


and  TIME  were  also  limited.  Typically,  MONEY  was 
specified  by  an  Arabic  numeral  +  a  monetary  unit  in 
katakana,  e.g.,  "  2  it  FA'"  ($200  million).  The 
occasional  mistake  made  by  the  systems  involved  not 
identifying  a  monetary  unit  other  than  the  predominant 
dollar  or  yen  such  as  F”  (British  pound). 

TIME  expressions  were  also  straightforward  in  their 
manner  of  representation.  Examples  of  valid  tags 
included  "0930,"  "48"  (morning),  5  0#" 

(5  PM),  etc.  An  anomalous  string  such  as 
cf "  (past  midnight),  in  which 
the  kanji  was  used  rather  than  the  numeral  "0," 
caused  problems  for  some  systems. 

DATE 

The  Japanese  participant  systems  processed  DATE 
expressions  successfully  despite  the  demands  made  by  the 
MET  Japanese  Guidelines  [4]  concerning  what  should  be 
tagged  and  the  wealth  of  patterns  used  to  represent  those 
expressions.  In  addition  to  absolute  DATEs  such  as 
"2  6  0"  (26th)  and  I 0  "  (Tuesday) 

MET  Guidelines  stipulated  that  relative  DATEs  such  as 
7  M"  (next  year  July)  were  to  be 
identified  as  well. 

The  requirement  for  tagging  relative  dates, 
furthermore,  introduced  to  this  task  a  class  of  Japanese 
semantic  attachments  that  complicated  the  identification 
process.  Whereas  (next,  coming)  and  "•£" 

(last)  are  in  the  initial  position  of  the  phrases 
(next  year)  and  "ic^"  (last  year),  other  semantic 
attachments  such  as  "?fc"  (end  of)  or  (early, 

beginning  of) ,  which  add  semantic  content  to  the  DATE 
expression  and  are,  therefore,  integral  parts  of  the  DATE 
tag,  are  in  the  postposition  of  phrases.  Examples  are 
"M  zfc"  (end  of  the  month)  and  "  5  jj 
(early  May).  It  was  not  uncommon,  therefore,  to 
encounter  in  these  texts  semantic-laden  relative  DATE 
expressions  containing  both  prepositions  and 
postpositions,  e.g.,  3  J=1  Tfc"  (end  of 

March  next  year).  In  the  end,  the  Japanese  systems 
identified  DATEs  with  an  extensive  number  of  different 
semantic  attachments  at  the  average  rate  of  94%. 

Hard  Tag  Type:  ENAMEX 

LOCATION 

LOCATION  expressions  were  typified  by  entities 
that  likely  would  be  contained  in  a  gazetteer  or  similar 
on-line  resource.  References  to  "'f  ^7^/1'" 

(Israel)  or  "7^  HI"  (U.S.)  for  example  were  readily 
identified  by  the  systems.  Other  semantic  clues  such  as 
the  locative  designators  (prefecture)  or  "W 
(state)  assisted  in  recognizing  more  obscure  place  names. 

This  task  was  complicated,  however,  by  the 
prevalence  of  embedded  L  OCATION  elements  within 
ORGANIZATION  expressions  and  the  effects  of  context 
upon  tag  type.  References  to  LOCATION  frequently 
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appeared  within  phrases  that  might  or  might  not 
subsume  the  LOCATION  under  another  tag.  For 
instance,  "U.S.-Japan  trade  negotiations"  would  be  an 
event  not  captured  by  a  singular  tag,  but  by  two 
LOCATION  tags  for  U.S.  and  Japan.  However,  the 
reference  to  (U.S.)  in  (U.S. 

Dept,  of  State)  was  considered  an  integral  part  of  the 
ORGANIZATION  name  and  not,  therefore,  segmented 
and  tagged  separately.  Determining  when  to  segment  and 
not  segment  the  place  sub-component  was  a 
complicating  factor  in  producing  the  proper  tag  type. 
Furthermore,  a  correctly  identified  ORGANIZATION 
such  as  (Diet)  was  directed  by  the  Guidelines 

to  be  tagged  as  LOCATION  when  the  context  in  which  it 
was  used  indicated  that  the  Diet  was  a  facility  or  structure 
—  i.e.,  if  a  press  conference  were  being  held  there.  The 
inability  of  systems  to  handle  this  complex  contextual 
shift  lowered  the  group  F-Measure  average  for 
LOCATION. 

PERSON 

Although  there  is  no  ready-made  on-line  resource  for 
person  names  like  a  gazetteer  for  place  names,  (and  even 
if  there  were,  its  enormous  size  would  slow  substantially 
processing  speed),  valid  PERSON  expressions  often 
contain  people  designators  in  the  form  of  titles,  roles,  or 
positions  such  as  "Be"  (Mr.),  (chairman),  or 

(president)  like  in 

”  AA  9  *7  (President  Mubarak). 

And,  although  the  number  of  different  designators 
manifested  throughout  the  corpora  was  sizable,  the 
participant  systems  identified  well  person  names 
occurring  in  this  type  of  pattern. 

There  were,  however,  other  patterns  in  evidence 
which  conflicted  with  the  predominant  one.  Preeminent 
among  these  alternate  patterns  was  the  expression  in 
which  a  tide/position  designator  was  preceded  by  both  a 
PERSON  and  LOCATION,  e.g., 

" # X.'S'/ 

(literally:  Mubarak  Egypt  President).  Systems  typically 
tagged  the  entire  phrase  preceding  the  title  ~  Mubarak 
Egypt  --  as  PERSON,  or  Egypt  alone  as  LOCATION. 
Therefore,  the  person  name  was  either  mistagged  or  not 
tagged  at  all. 

ORGANIZATION 

Identifying  and  categorizing  complex  noun  phrases 
in  strings  where  there  is  no  capitalization  and  whitespace 
make  this  type  of  expression  the  most  difficult  to  process 
(group  F-Measure  average  73%).  Normally,  the 
corporate  designator  "|fc"  (Co.,  Corp.)  would  assist  in 
identifying  an  ORGANIZATION.  However,  the  MET 
domain  focused  on  political  rather  than  commercial 
entities,  so  there  were  very  few  instances  of  this 
designator.  And,  although  bureaucratic  descriptors  like 

indicate  Japanese  ministries,  often  well-known 
ministries  such  as  (MITI)  are  aliased 

(ft&)  without  mention  of  the  canonical  form. 


In  addition,  the  most  prevalent  entities  properly 
identified  as  ORGANIZATION  in  these  texts  included 
groups,  offices,  labs,  etc.  --  that  is,  noun  phrases  which 
could  be  proper  nouns  depending  upon  context.  For 
example,  EHIf|§"  could  be  the  name  of  a 
particular  factory,  Miyata  Factory,  or  a  generic  factory 
located  in  Miyata;  similarly,  " 

could  be  the  New  Hyogo  Bank,  the  new  (e.g.,  rebuilt) 
Hyogo  Bank,  or  a  new  Hyogo  Bank  (i.e.,  one  bank  in  the 
Hyogo  Bank  chain). 

To  complicate  matters  further,  once  a  complex  NP 
like  (MITI 

Telecommunications  Subcommittee)  is  determined  to  be 
a  proper  noun,  the  systems  next  were  required  to  tag  as 
ORGANIZATION  each  constituent  part  of  the 
hierarchical  relationship  expressed  within  the  phrase.  In 
this  case,  there  were  two:  MITI  (parent)  and 
Telecommunications  Subcommittee  (child). 

Summary 

The  Japanese  systems  showed  excellent  overall 
results  despite  a  very  compressed  development  cycle. 
They  handled  comparatively  easy  types  of  expressions 
with  a  high  —  >90%  —  degree  of  accuracy,  and  the  hard 
expressions  with  surprising  proficiency,  thereby 
promising  marked  improvement  in  the  near  term  and  the 
capability  to  work  in  conjunction  with  other  language 
processing  technologies  such  as  Machine  Translation 
(MT)  and  text  summarization  [5], 
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